Computer-Implemented System And Method For Providing Visual Classification Suggestions For Inclusion-Based Concept Clusters

ABSTRACT

A computer-implemented system and method for providing visual classification suggestions for inclusion-based concept clusters are provided. Reference concepts each associated with a classification code are designated. One or more of the reference concepts are grouped with a plurality of uncoded concepts into a grouped concept set. Clusters are generated, each including a portion of the uncoded concepts and the reference concepts of the grouped concept set. A visual suggestion for assigning one of the classification codes to one of the clusters including visually representing each of the reference concepts in that cluster is provided.

CROSS-REFERENCE TO RELATED APPLICATION

This patent application is a continuation of U.S. patent applicationSer. No. 12/844,810, filed Jul. 27, 2010, pending, which is anon-provisional patent application which claims priority under 35 U.S.C.§119(e) to U.S. Provisional Patent Application Ser. No. 61/229,216,filed Jul. 28, 2009, and U.S. Provisional Patent Application Ser. No.61/236,490, filed Aug. 24, 2009, the disclosures of which areincorporated by reference.

FIELD

This application relates in general to using documents as a referencepoint and, in particular, to a system and method for displayingrelationships between concepts to provide classification suggestions viainclusion.

BACKGROUND

Historically, document review during the discovery phase of litigationand for other types of legal matters, such as due diligence andregulatory compliance, have been conducted manually. During documentreview, individual reviewers, generally licensed attorneys, are assignedsets of documents for coding. A reviewer must carefully study eachdocument and categorize the document by assigning a code or other markerfrom a set of descriptive classifications, such as “privileged,”“responsive,” and “non-responsive.” The classifications can affect thedisposition of each document, including admissibility into evidence.

During discovery, document review can potentially affect the outcome ofthe underlying legal matter, so consistent and accurate results arecrucial. Manual document review is tedious and time-consuming. Markingdocuments is solely at the discretion of each reviewer and inconsistentresults may occur due to misunderstanding, time pressures, fatigue, orother factors. A large volume of documents reviewed, often with onlylimited time, can create a loss of mental focus and a loss of purposefor the resultant classification. Each new reviewer also faces a steeplearning curve to become familiar with the legal matter, classificationcategories, and review techniques.

Currently, with the increasingly widespread movement to electronicallystored information (ESI), manual document review is no longerpracticable. The often exponential growth of ESI exceeds the boundsreasonable for conventional manual human document review and underscoresthe need for computer-assisted ESI review tools.

Conventional ESI review tools have proven inadequate to providingefficient, accurate, and consistent results. For example, DiscoverReadyLLC, a Delaware limited liability company, custom programs ESI reviewtools, which conduct semi-automated document review through multiplepasses over a document set in ESI form. During the first pass, documentsare grouped by category and basic codes are assigned. Subsequent passesrefine and further assign codings. Multiple pass review requires apriori project-specific knowledge engineering, which is only useful forthe single project, thereby losing the benefit of any inferred knowledgeor know-how for use in other review projects.

Thus, there remains a need for a system and method for increasing theefficiency of document review that bootstraps knowledge gained fromother reviews while ultimately ensuring independent reviewer discretion.

SUMMARY

Document review efficiency can be increased by identifying relationshipsbetween reference documents and uncoded documents and providing asuggestion for classification based on the relationships. The referencedocuments and uncoded documents are clustered based on a similarity ofthe documents. The clusters and the relationship between the uncodeddocuments and reference documents within the cluster are visuallydepicted. The visual relationship of the uncoded documents and referencedocuments provide a suggestion regarding classification for the uncodeddocuments.

One embodiment provides a computer-implemented system and method forproviding visual classification suggestions for inclusion-based conceptclusters. Reference concepts each associated with a classification codeare designated. One or more of the reference concepts are grouped with aplurality of uncoded concepts into a grouped concept set. Clusters aregenerated, each including a portion of the uncoded concepts and thereference concepts of the grouped concept set. A visual suggestion forassigning one of the classification codes to one of the clustersincluding visually representing each of the reference concepts in thatcluster is provided.

Still other embodiments of the present invention will become readilyapparent to those skilled in the art from the following detaileddescription, wherein are described embodiments by way of illustratingthe best mode contemplated for carrying out the invention. As will berealized, the invention is capable of other and different embodimentsand its several details are capable of modifications in various obviousrespects, all without departing from the spirit and the scope of thepresent invention. Accordingly, the drawings and detailed descriptionare to be regarded as illustrative in nature and not as restrictive.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram showing a system for displaying relationshipsbetween concepts to provide classification suggestions via inclusion, inaccordance with one embodiment.

FIG. 2 is a process flow diagram showing a method for displayingrelationships between concepts to provide classification suggestions viainclusion, in accordance with one embodiment.

FIG. 3 is a block diagram showing, by way of example, measures forselecting reference concept subsets for use in the method of FIG. 2.

FIG. 4 is a table showing, by way of example, a matrix mapping ofuncoded concepts and documents.

FIG. 5 is a process flow diagram showing, by way of example, a methodfor forming clusters for use in the method of FIG. 2.

FIG. 6 is a screenshot showing, by way of example, a visual display ofreference concepts in relation to uncoded documents.

FIG. 7A is a block diagram showing, by way of example, a cluster with“privileged” reference concepts and uncoded concepts.

FIG. 7B is a block diagram showing, by way of example, a cluster with“non-responsive” reference concepts and uncoded concepts.

FIG. 7C is a block diagram showing, by way of example, a cluster withuncoded concepts and a combination of differently classified referenceconcepts.

FIG. 8 is a process flow diagram showing, by way of example, a methodfor classifying uncoded concepts for use in the method of FIG. 2.

FIG. 9 is a screenshot showing, by way of example, a reference optionsdialogue box for entering user preferences for clustering concepts.

DETAILED DESCRIPTION

The ever-increasing volume of ESI underlies the need for automatingdocument review for improved consistency and throughput. Tokenclustering via injection utilizes reference, or previously classifiedtokens, which offer knowledge gleaned from earlier work in similar legalprojects, as well as a reference point for classifying uncoded tokens.

The tokens can include word-level, symbol-level, or character-leveln-grams, raw terms, entities, or concepts. Other tokens, including otheratomic parse-level elements, are possible. An n-gram is a predeterminednumber of items selected from a source. The items can include syllables,letters, or words, as well as other items. A raw term is a term that hasnot been processed or manipulated. Entities further refine nouns andnoun phrases into people, places, and things, such as meetings, animals,relationships, and various other objects. Additionally, entities canrepresent other parts of grammar associated with semantic meanings todisambiguate different instances or occurrences of the grammar. Entitiescan be extracted using entity extraction techniques known in the field.Concepts are collections of nouns and noun-phrases with common semanticmeaning that can be extracted from ESI, including documents, throughpart-of-speech tagging. Each concept can represent one or more documentsto be classified during a review. Clustering of the concepts provides anoverall view of the document space, which allows users to easilyidentify documents sharing a common theme.

The clustering of tokens, for example, concepts, differs from documentclustering, which groups related documents individually. In contrast,concept clustering groups related concepts, which are eachrepresentative of one or more related documents. Each concept canexpress an ideas or topic that may not be expressed by individualdocuments. A concept is analogous to a search query by identifyingdocuments associated with a particular idea or topic.

A user can determine how particular concepts are related based on theconcept clustering. Further, users are able to intuitively identifydocuments by selecting one or more associated concepts in a cluster. Forexample, a user may wish to identify all documents in a particularcorpus that are related to car manufacturing. The user can select theconcept “car manufacturing” or “vehicle manufacture” within one of theclusters and subsequently, the associated documents are presented.However, during document clustering, a user is first required to selecta specific document from which other documents that are similarlyrelated can then be identified.

Providing Suggestions Using Reference Concepts

Reference concepts are previously classified based on the documentcontent represented by that concept and can be injected into clusters ofuncoded, that is unclassified, concepts to influence classification ofthe uncoded concepts. Specifically, relationships between an uncodedconcept and the reference concepts, in terms of semantic similarity ordistinction, can be used as an aid in providing suggestions forclassifying uncoded concepts. Once classified, the newly-coded, orreference, concepts can be used to further classify the representeddocuments. Although tokens, such as word-level or character-leveln-grams, raw terms, entities, or concepts, can be clustered anddisplayed, the discussion below will focus on a concept as a particulartoken.

Complete ESI review requires a support environment within whichclassification can be performed. FIG. 1 is a block diagram showing asystem 10 for displaying relationships between concepts to provideclassification suggestions via inclusion, in accordance with oneembodiment. By way of illustration, the system 10 operates in adistributed computing environment, which includes a plurality ofheterogeneous systems and ESI sources. Henceforth, a single item of ESIwill be referenced as a “document,” although ESI can include other formsof non-document data, as described infra. A backend server 11 is coupledto a storage device 13, which stores documents 14 a in the form ofstructured or unstructured data, a database 30 for maintaininginformation about the documents, a lookup database 38 for storingmany-to-many mappings 39 between documents and document features, and aconcept document index 40, which maps documents to concepts. The storagedevice 13 also stores classified documents 14 b, concepts 14 c, andreference concepts 14 d. Concepts are collections of nouns andnoun-phrases with common semantic meaning The nouns and noun-phrases canbe extracted from one or more documents in the corpus for review.Hereinafter, the terms “classified” and “coded” are used interchangeablywith the same intended meaning, unless otherwise indicated. A set ofreference concept s can be hand-selected or automatically selectedthrough guided review, which is further discussed below. Additionally,the set of reference concept s can be predetermined or can be generateddynamically, as uncoded concept s are classified and subsequently addedto the set of reference concept s.

The backend server 11 is coupled to an intranetwork 21 and executes aworkbench suite 31 for providing a user interface framework forautomated document management, processing, analysis, and classification.In a further embodiment, the backend server 11 can be accessed via aninternetwork 22. The workbench software suite 31 includes a documentmapper 32 that includes a clustering engine 33, similarity searcher 34,classifier 35, and display generator 36. Other workbench suite modulesare possible.

The clustering engine 33 performs efficient document scoring andclustering of uncoded concept s and reference concept s, such asdescribed in commonly-assigned U.S. Pat. No. 7,610,313, the disclosureof which is incorporated by reference. Clusters of uncoded concept s 14c and reference concept s 14 d are formed and organized along vectors,known as spines, based on a similarity of the clusters. The similaritycan be expressed in terms of distance. Concept clustering is furtherdiscussed below with reference to FIG. 5. The classifier 35 provides amachine-generated suggestion and confidence level for classification ofselected uncoded concept s 14 c, clusters, or spines, as furtherdescribed below with reference to FIG. 8.

The display generator 36 arranges the clusters and spines in thematicrelationships in a two-dimensional visual display space, as furtherdescribed below beginning with reference to FIG. 2. Once generated, thevisual display space is transmitted to a work client 12 by the backendserver 11 via the document mapper 32 for presenting to a reviewer on adisplay 37. The reviewer can include an individual person who isassigned to review and classify one or more uncoded concept s bydesignating a code. Hereinafter, the terms “reviewer” and “custodian”are used interchangeably with the same intended meaning, unlessotherwise indicated. Other types of reviewers are possible, includingmachine-implemented reviewers.

The document mapper 32 operates on uncoded concept s 14 a, which can beretrieved from the storage 13, as well as from a plurality of local andremote sources. As well, the local and remote sources can also store thereference documents 14 b, concepts 14 c, and reference concepts 14 d.The local sources include documents and concepts 17 maintained in astorage device 16 coupled to a local server 15, and documents andconcepts 20 maintained in a storage device 19 coupled to a local client18. The local server 15 and local client 18 are interconnected to thebackend server 11 and the work client 12 over an intranetwork 21. Inaddition, the document mapper 32 can identify and retrieve concepts fromremote sources over an internetwork 22, including the Internet, througha gateway 23 interfaced to the intranetwork 21. The remote sourcesinclude documents and concepts 26 maintained in a storage device 25coupled to a remote server 24, and documents and concepts 29 maintainedin a storage device 28 coupled to a remote client 27. Other documentsources, either local or remote, are possible.

The individual documents 14 a, 14 b,17, 20, 26, 29 include all forms andtypes of structured and unstructured ESI, including electronic messagestores, word processing documents, electronic mail (email) folders, Webpages, and graphical or multimedia data. Notwithstanding, the documentscould be in the form of structurally organized data, such as stored in aspreadsheet or database.

In one embodiment, the individual documents 14 a, 14 b, 17, 20, 26, 29include electronic message folders storing email and attachments, suchas maintained by the Outlook and Outlook Express products, licensed byMicrosoft Corporation, Redmond, Wash. The database can be an SQL-basedrelational database, such as the Oracle database management system,Release 8, licensed by Oracle Corporation, Redwood Shores, Calif.

Additionally, the individual concepts 14 c, 14 d, 17, 20, 26, 29 includeuncoded concepts 14 c and reference concepts 14 d. The uncoded concepts14 c, which are unclassified, represent collections of nouns andnoun-phrases that are semantically related and extracted from documentsin a document review project. The reference concepts 14 d are initiallyuncoded concepts that can represent documents selected from the corpusor other sources of documents. The reference concepts 14 d assist inproviding suggestions for classification of the remaining uncodedconcepts representative of the document corpus based on visualrelationships between the uncoded concepts and reference concepts. Thereviewer can classify one or more of the remaining uncoded concepts byassigning a classification code based on the relationships. In a furtherembodiment, the reference concepts can be used as a training set to formmachine-generated suggestions for classifying the remaining uncodedconcepts, as further described below with reference to FIG. 8.

The concept corpus for a document review project can be divided intosubsets of uncoded concepts, which are each provided to a particularreviewer as an assignment. The uncoded documents are analyzed toidentify concepts, which are subsequently clustered. A classificationcode can be assigned to each of the clustered concepts. To maintainconsistency, the same codes can be used across all concepts representingassignments in the document review project. The classification codes canbe determined using taxonomy generation, during which a list ofclassification codes can be provided by a reviewer or determinedautomatically. The classification code of a concept can be assigned tothe documents associated with that concept.

For purposes of legal discovery, the list of classification codes caninclude “privileged,” “responsive,” or “non-responsive,” however, otherclassification codes are possible. The assigned classification codes canbe used as suggestions for classification of associated documents. Forexample, a document associated with three concepts, each assigned a“privileged” classification can also be considered “privileged.” Othertypes of suggestions are possible. A “privileged” document containsinformation that is protected by a privilege, meaning that the documentshould not be disclosed or “produced” to an opposing party. Disclosing a“privileged” document can result in an unintentional waiver of thesubject matter disclosed. A “responsive” document contains informationthat is related to the legal matter, while a “non-responsive” documentincludes information that is not related to the legal matter.

The system 10 includes individual computer systems, such as the backendserver 11, work server 12, server 15, client 18, remote server 24 andremote client 27. The individual computer systems are general purpose,programmed digital computing devices consisting of a central processingunit (CPU), random access memory (RAM), non-volatile secondary storage,such as a hard drive or CD ROM drive, network interfaces, and peripheraldevices, including user interfacing means, such as a keyboard anddisplay. The various implementations of the source code and object andbyte codes can be held on a computer-readable storage medium, such as afloppy disk, hard drive, digital video disk (DVD), random access memory(RAM), read-only memory (ROM) and similar storage mediums. For example,program code, including software programs, and data are loaded into theRAM for execution and processing by the CPU and results are generatedfor display, output, transmittal, or storage.

Identifying relationships between the reference concepts and uncodedconcepts includes clustering. FIG. 2 is a process flow diagram showing amethod 50 for displaying relationships between concepts to provideclassification suggestions via inclusion, in accordance with oneembodiment. A subset of reference concepts is identified and selected(block 51) from a representative set of reference concepts. The subsetof reference concepts can be predefined, arbitrary, or specificallyselected, as discussed further below with reference to FIG. 3. Uponidentification, the reference concept subset is grouped with uncodedconcepts (block 52). The uncoded concepts can include all uncodedconcepts in an assignment or in a corpus. The grouped concepts,including uncoded and reference concepts are organized into clusters(block 53). Clustering of the concepts is discussed further below withreference to FIG. 5.

Once formed, the clusters can be displayed to visually depictrelationships (block 54) between the uncoded concepts and the referenceconcepts. The relationships can provide a suggestion, which can be usedby an individual reviewer for classifying one or more of the uncodedconcepts, clusters, or spines. Based on the relationships, the reviewercan classify the uncoded concepts, clusters, or spines by assigning aclassification code, which can represent a relevancy of the uncodedconcept to the document review project. Further, machine classificationcan provide a suggestion for classification, including a classificationcode, based on a calculated confidence level (block 55). Classifyinguncoded concepts is further discussed below with reference to FIG. 8.

In one embodiment, the classified concepts can be used as suggestionsfor classifying those documents represented by that concept. Forexample, in a product liability lawsuit, the plaintiff claims that awood composite manufactured by the defendant induces and harbors moldgrowth. During discovery, all documents within the corpus for thelawsuit and relating to mold should be identified for review. Theconcept for mold is clustered and includes a “responsive” classificationcode, which indicates that the noun phrase mold is related to the legalmatter. Upon selection of the mold concept, all documents that includethe noun phrase mold can be identified using the mapping matrix, whichis described below with reference to FIG. 3. The responsiveclassification code assigned to the concept can be used as a suggestionfor the document classification. However, if the document is representedby multiple concepts with different classification codes, each differentcode can be considered during classification of the document.

In a further embodiment, the concept clusters can be used with documentclusters, which are described in commonly-owned in U.S. PatentApplication Publication No. 20110029526, published Feb. 3, 2011,pending, and U.S. Pat. No. 8,515,957, issued Aug. 20, 2013, thedisclosures of which is incorporated by reference. For example,selecting a concept in the concept cluster display can identify one ormore documents with a common idea or topic. Further selection of one ofthe documents represented by the selected cluster in the documentconcept display can identify documents that are similarly related to thecontent of the selected document. The identified documents can be thesame or different as the other documents represented by the concept.

Similar documents can also be identified as described incommonly-assigned U.S. Pat. No. 8,572,084, issued Oct. 29, 2013, thedisclosure of which is incorporated by reference.

In an even further embodiment, the documents identified from one of theconcepts can be classified automatically as described incommonly-assigned U.S. Pat. No. 8,635,223, issued Jan. 21, 2014,pending, the disclosure of which is incorporated by reference.

Identifying a Set and Subset of Reference Concepts

Prior to clustering, the uncoded concepts and reference concepts areobtained. The reference concepts used for clustering can include aparticular subset of reference concepts, which are selected from ageneral set of reference concepts. Alternatively, the entire set ofreference concepts can be clustered with the uncoded concepts. The setof reference concepts is representative of document in the corpus for adocument review project in which data organization or classification isdesired. The reference concept set can be previously defined andmaintained for related concept review projects or can be specificallygenerated for each review project. A predefined reference set providesknowledge previously obtained during the related concept review projectto increase efficiency, accuracy, and consistency. Reference sets newlygenerated for each review project can include arbitrary or customizedreference sets that are determined by a reviewer or a machine.

The set of reference concepts can be generated during guided review,which assists a reviewer in building a reference concept set. Duringguided review, the uncoded concepts that are dissimilar to the otheruncoded concepts are identified based on a similarity threshold. Othermethods for determining dissimilarity are possible. Identifying a set ofdissimilar concepts provides a group of uncoded concepts that isrepresentative of the corpus for the document review project. Eachidentified dissimilar concept is then classified by assigning aparticular classification code based on the content of the concept tocollectively generate a set of reference concepts. Guided review can beperformed by a reviewer, a machine, or a combination of the reviewer andmachine.

Other methods for generating a reference concept set for a documentreview project using guided review are possible, including clustering.For example, a set of uncoded concepts to be classified is clustered, asdescribed in commonly-assigned U.S. Pat. No. 7,610,313, the disclosureof which is incorporated by reference. A plurality of the clustereduncoded concepts are selected based on selection criteria, such ascluster centers or sample clusters. The cluster centers can be used toidentify uncoded concepts in a cluster that are most similar ordissimilar to the cluster center. The identified uncoded concepts arethen selected for classification by assigning classification codes.After classification, the concepts represent a reference set. In afurther embodiment, sample clusters can be used to generate a referenceconcept set by selecting one or more sample clusters based on clusterrelation criteria, such as size, content, similarity, or dissimilarity.The uncoded concepts in the selected sample clusters are then assignedclassification codes. The classified concepts represent a conceptreference set for the document review project. Other methods forselecting concepts for use as a reference set are possible.

Once generated, a subset of reference concepts is selected from thereference concept set for clustering with uncoded concepts. FIG. 3 is ablock diagram showing, by way of example, measures 60 for selectingreference concept subsets 61 for use in the method of FIG. 2. Areference concept subset 61 includes one or more reference conceptsselected from a set of reference concepts associated with a documentreview project for use in clustering with uncoded concepts. Thereference concept subset can be predefined 62, customized 64, selectedarbitrarily 63, or based on similarity 65.

A subset of predefined reference concepts 62 can be selected from areference set, which is associated with another document review projectthat is related to the current document review project. An arbitraryreference subset 63 includes reference concepts randomly selected from areference set, which can be predefined or newly generated for thecurrent document review project or a related document review project. Acustomized reference subset 64 includes reference concepts specificallyselected from a current or related reference set based on criteria, suchas reviewer preference, classification category, document source,content, and review project. Other criteria are possible. The number ofreference concepts in a subset can be determined automatically or by areviewer based on reference factors, such as a size of the documentreview project, an average size of the assignments, types ofclassification codes, and a number of reference concepts associated witheach classification code. Other reference factors are possible. In afurther embodiment, the reference concept subset can include more thanone occurrence of a reference concept. Other types of reference conceptsubsets and methods for selecting the reference concept subsets arepossible.

Forming Clusters

Once identified, the reference concept subset can be used for clusteringwith uncoded concept representative of a corpus for a particulardocument review project. The corpus of uncoded concepts for a reviewproject can be divided into assignments using assignment criteria, suchas custodian or source of the uncoded concept, content, document type,and date. Other criteria are possible. In one embodiment, eachassignment is assigned to an individual reviewer for analysis. Theassignments can be separately clustered with the reference conceptsubset or alternatively, all of the uncoded concepts in the corpus canbe clustered with the reference concept subset. The assignments can beseparately analyzed or alternatively, analyzed together to determineconcepts for the one or more document assignments. The content of eachdocument within the corpus can be converted into a set of concepts. Asdescribed above, concepts typically include nouns and noun phrasesobtained through part-of-speech tagging that have a common semanticmeaning. The concepts, which are representative of the documents can beclustered to provide an intuitive grouping of the document content.

Clustering of the uncoded concepts provides groupings of related uncodedconcepts and is based on a similarity metric using score vectorsassigned to each uncoded concept. The score vectors can be generatedusing a matrix showing the uncoded concepts in relation to documentsthat contain the concepts. FIG. 4 is a table showing, by way of example,a matrix mapping 70 of uncoded concepts 74 and documents 73. The uncodeddocuments 73 are listed along a horizontal dimension 71 of the matrix,while the concepts 74 are listed along a vertical dimension 72. However,the placement of the uncoded documents 73 and concepts 74 can bereversed. Each cell 75 within the matrix 70 includes a cumulative numberof occurrences of each concept within a particular uncoded document 73.Score vectors can be generated for each document by identifying theconcepts and associated weights within that document and ordering theconcepts along a vector with the associated concept weight. In thematrix 70, the score vector 76 for a document 73 can be identified asall the concepts included in that document and the associated weights,which are based on the number of occurrences of each concept. Scorevectors can also be generated for each concept by identifying thedocuments that contain that concept and determining a weight associatedwith each document. The documents and associated weights are thenordered along a vector for each concept, as the concept score vector. Inthe matrix 70, the score vector 77 for a concept can be identified asall the documents that contain that concept and the associated weights.

Clustering provides groupings of related uncoded concepts and referenceconcepts. FIG. 5 is a flow diagram showing a routine 80 for formingclusters for use in the method 40 of FIG. 2. The purpose of this routineis to use score vectors associated with the concepts, including uncodedand reference concepts, to form clusters based on relative similarity.Hereinafter, the term “concept” is intended to include uncoded conceptsand reference concepts selected for clustering, unless otherwiseindicated. The score vector associated with each concept includes a setof paired values of documents and associated weights, which are based onscores. The score vector is generated by scoring the documents, asdescribed in commonly-assigned U.S. Pat. No. 7,610,313, the disclosureof which is incorporated by reference.

As an initial step for generating score vectors, each document within aconcept is individually scored. Next, a normalized score vector iscreated for the concept by identifying paired values, consisting of adocument represented by that concept and the scores for that document.The paired values are ordered along a vector to generate the scorevector. The paired values can be ordered based on the documents, as wellas other factors. For example, assume a normalized score vector for afirst Concept A is {right arrow over (S)}_(A){(5, 0.5), (120, 0.75)} anda normalized score vector for another Concept B is {right arrow over(S)}_(B)={(3, 0.4), (5, 0.75), (47, 0.15)}. Concept A has scorescorresponding to tokens ‘5’ and ‘120’ and Concept B has scorescorresponding to tokens ‘3,’ ‘5’ and ‘47.’ Thus, these concepts onlyhave token ‘5’ in common. Once generated, the score vectors can becompared to determine similarity or dissimilarity between thecorresponding concepts during clustering.

The routine for forming clusters of concepts, including uncoded conceptsand reference concepts, proceeds in two phases. During the first phase(blocks 83-88), the concepts are evaluated to identify a set of seedconcepts, which can be used to form new clusters. During the secondphase (blocks 90-96), any concepts not previously placed are evaluatedand grouped into the existing clusters based on a best-fit criterion.

Initially, a single cluster is generated with one or more concepts asseed concepts and additional clusters of concepts are added, ifnecessary. Each cluster is represented by a cluster center that isassociated with a score vector, which is representative of all thedocuments associated with concepts in that cluster. The cluster centerscore vector can be generated by comparing the score vectors for theindividual concepts in the cluster and identifying common documentsshared by the concepts. The most common documents and associated weightsare ordered along the cluster center score vector. Cluster centers andthus, cluster center score vectors may continually change due to theaddition and removal of concepts during clustering.

During clustering, the concepts are identified (block 81) and ordered bylength (block 82). The concepts can include all reference concepts in asubset and one or more assignments of uncoded concepts. Each concept isthen processed in an iterative processing loop (blocks 83-88) asfollows. The similarity between each concept and a center of eachcluster is determined (block 84) as the cosine (cos) a of the scorevectors for the concept and cluster being compared. The cos a provides ameasure of relative similarity or dissimilarity between the conceptsassociated with the documents and is equivalent to the inner productsbetween the score vectors for the concept and cluster center.

In the described embodiment, the cos a is calculated in accordance withthe equation:

${\cos \; \sigma_{AB}} = \frac{\langle{{\overset{\rightarrow}{S}}_{A} \cdot {\overset{\rightarrow}{S}}_{B}}\rangle}{{{\overset{\rightarrow}{S}}_{A}}{{\overset{\rightarrow}{S}}_{B}}}$

where cos σ_(AB) comprises the similarity metric between Concept A andcluster center B, {right arrow over (S)}_(A) comprises a score vectorfor the Concept A, and {right arrow over (S)}_(B) comprises a scorevector for the cluster center B. Other forms of determining similarityusing a distance metric are feasible, as would be recognized by oneskilled in the art. An example includes using Euclidean distance.

Only those concepts that are sufficiently distinct from all clustercenters (block 85) are selected as seed concepts for forming newclusters (block 86). If the concept being compared is not sufficientlydistinct (block 85), the concept is then grouped into a cluster with themost similar cluster center (block 87). Processing continues with thenext concept (block 88).

In the second phase, each concept not previously placed is iterativelyprocessed in an iterative processing loop (blocks 90-96) as follows.Again, the similarity between each remaining concept and each of thecluster centers is determined based on a distance (block 91), such asthe cos a of the normalized score vectors for each of the remainingconcepts and the cluster centers. A best fit between a remaining conceptand a cluster center can be found subject to a minimum fit criterion(block 92). In the described embodiment, a minimum fit criterion of 0.25is used, although other minimum fit criteria could be used. If a bestfit is found (block 93), the remaining concept is grouped into thecluster having the best fit (block 95). Otherwise, the remaining conceptis grouped into a miscellaneous cluster (block 94). Processing continueswith the next remaining concept (block 96). Finally, a dynamic thresholdcan be applied to each cluster (block 97) to evaluate and strengthenconcept membership in a particular cluster. The dynamic threshold isapplied based on a cluster-by-cluster basis, as described incommonly-assigned U.S. Pat. No. 7,610,313, the disclosure of which isincorporated by reference. The routine then returns. Other methods andprocesses for forming clusters are possible.

Alternatively, clusters can be generated by injection as furtherdescribed in commonly-owned U.S. Pat. No. 8,635,223, issued Jan. 21,2014, the disclosure of which is incorporated by reference.

Once clustered, similar concepts can be identified as described incommonly-assigned U.S. Pat. No. 8,645,378, issued Feb. 4, 2014, thedisclosure of which is incorporated by reference.

Displaying the Reference Concepts

Once formed, the clusters of concepts can be can be organized togenerate spines of thematically related clusters, as described incommonly-assigned U.S. Pat. No. 7,271,804, the disclosure of which isincorporated by reference. Each spine includes those clusters that shareone or more concepts, which are placed along a vector. Also, the clusterspines can be positioned in relation to other cluster spines based on atheme shared by those cluster spines, as described in commonly-assignedU.S. Pat. No. 7,610,313, the disclosure of which is incorporated byreference. Each theme can include one or more concepts defining asemantic meaning Organizing the clusters into spines and groups ofcluster spines provides an individual reviewer with a display thatpresents the concepts according to a theme while maximizing the numberof relationships depicted between the concepts.

FIG. 6 is a screenshot 100 showing, by way of example, a visual display81 of reference concepts 105 in relation to uncoded concepts 104.Clusters 103 can be located along a spine, which is a straight vector,based on a similarity of the concepts 104, 105 in the clusters 103. Eachcluster 103 is represented by a circle; however, other shapes, such assquares, rectangles, and triangles are possible, as described in U.S.Pat. No. 6,888,548, the disclosure of which is incorporated byreference. The uncoded concepts 104 are each represented by a smallercircle within the clusters 103, while the reference concepts 105 areeach represented by a circle having a diamond shape within theboundaries of the circle. The reference concepts 105 can be furtherrepresented by their assigned classification code. The classificationcodes can include “privileged,” “responsive,” and “non-responsive”codes, as well as other codes. Each group of reference conceptsassociated with a particular classification code can be identified by adifferent color. For instance, “privileged” reference concepts can becolored blue, while “non-responsive” reference concepts are red and“responsive” reference concepts are green. In a further embodiment, thereference concepts for different classification codes can includedifferent symbols. For example, “privileged” reference concepts can berepresented by a circle with an “X” in the center, while“non-responsive” reference concepts can include a circle with stripedlines and “responsive” reference concepts can include a circle withdashed lines. Other classification representations for the referenceconcepts are possible. Each cluster spine 86 is represented as astraight vector along which the clusters are placed.

The display 101 can be manipulated by an individual reviewer via acompass 102, which enables the reviewer to navigate, explore, and searchthe clusters 103 and spines 106 appearing within the compass 102, asfurther described in commonly-assigned U.S. Pat. No. 7,356,777, thedisclosure of which is incorporated by reference. Visually, the compass102 emphasizes clusters 103 located within the compass 102, whiledeemphasizing clusters 103 appearing outside of the compass 102.

Spine labels 109 appear outside of the compass 102 at an end of eachcluster spine 106 to connect the outermost cluster of a cluster spine106 to the closest point along the periphery of the compass 102. In oneembodiment, the spine labels 109 are placed without overlap andcircumferentially around the compass 102. Each spine label 109corresponds to one or more documents represented by the clusteredconcepts that most closely describe the cluster spines 106.Additionally, the documents associated with each of the spine labels 109can appear in a documents list (not shown) also provided in the display.Additionally, the cluster concepts for each of the spine labels 109 canappear in a documents list (not shown) also provided in the display.Toolbar buttons 107 located at the top of the display 101 enable a userto execute specific commands for the composition of the spine groupsdisplayed. A set of pull down menus 108 provide further control over theplacement and manipulation of clusters 103 and cluster spines 106 withinthe display 101. Other types of controls and functions are possible.

A concept guide 110 can be placed within the display 101. The conceptguide 110 can include a “Selected” field, a “Search Results” field, anddetails regarding the numbers of uncoded concepts and reference conceptsprovided in the display. The number of uncoded concepts includes alluncoded concepts selected for clustering, such as within a corpus ofuncoded concepts for a review project or within an assignment. Thenumber of reference concepts includes the reference concept subsetselected for clustering. The “Selected” field in the document guide 110provides a number of concepts within one or more clusters selected bythe reviewer. The reviewer can select a cluster by “double clicking” thevisual representation of that cluster using a mouse. The “SearchResults” field provides a number of uncoded concepts and referenceconcepts that include a particular search term identified by thereviewer in a search query box 112.

In one embodiment, a garbage can 111 is provided to remove documents,from consideration in the current set of clusters 103. Removed clusterdocuments prevent those documents from affecting future clustering, asmay occur when a reviewer considers a document irrelevant to theclusters 103.

The display 101 provides a visual representation of the relationshipsbetween thematically-related concepts, including the uncoded conceptsand reference concepts. The uncoded concepts and reference conceptslocated within a cluster or spine can be compared based oncharacteristics, such as the assigned classification codes of thereference concepts, a number of reference concepts associated with eachclassification code, and a number of different classification codes toidentify relationships between the uncoded concepts and referenceconcepts. The reviewer can use the displayed relationships assuggestions for classifying the uncoded concepts. For example, FIG. 7Ais a block diagram showing, by way of example, a cluster 120 with“privileged” reference concepts 122 and uncoded concepts 121. Thecluster 120 includes nine uncoded concepts 121 and three referenceconcepts 122. Each reference concept 122 is classified as “privileged.”Accordingly, based on the number of “privileged” reference concepts 122present in the cluster 120, the absence of other classifications ofreference concepts, and the thematic relationship between the uncodedconcepts 94 and the “privileged” reference concepts 122, the reviewermay be more inclined to review the uncoded concepts 12 lin that cluster120 or to classify one or more of the uncoded concepts 121 as“privileged” without review.

Alternatively, the three reference concepts can be classified as“non-responsive,” instead of “privileged” as in the previous example.FIG. 7B is a block diagram showing, by way of example, a cluster 123with “non-responsive” reference concepts 124 and uncoded concepts 121.The cluster 123 includes nine uncoded concepts 121 and three“non-responsive” concepts 124. Since the uncoded concepts 121 in thecluster are thematically related to the “non-responsive” referenceconcepts 124, the reviewer may wish to assign a “non-responsive” code toone or more of the uncoded concepts 121 without review, as they are mostlikely not relevant to the legal matter associated with the documentreview project. In making a decision to assign a code, such as“non-responsive,” the reviewer can consider the number of“non-responsive” reference concepts in the cluster, the presence orabsence of other reference concept classification codes, and thethematic relationship between the “non-responsive” reference conceptsand the uncoded concepts. Thus, the presence of the three“non-responsive” reference concepts 124 in the cluster provides asuggestion that the uncoded concepts 121 may also be “non-responsive.”Further, the label 109 associated with the spine 106 upon which thecluster is located can also be used to influence a suggestion.

A further example can include a cluster with combination of “privileged”and “non-responsive” reference concepts. For example, FIG. 7C is a blockdiagram showing, by way of example, a cluster 125 with uncoded concepts121 and a combination of differently classified reference concepts 122,124. The cluster 125 can include one “privileged” reference concept 122,two “non-responsive” reference concepts 124, and nine uncoded concepts121. The “privileged” 122 and “non-responsive” 124 reference conceptscan be distinguished by different colors or shape, as well as otheridentifiers. The combination of “privileged” 122 and “non-responsive”124 reference concepts within the cluster 98 can suggest to a reviewerthat the uncoded reference concepts 121 should be reviewed beforeclassification or that one or more uncoded reference concepts 121 shouldbe classified as “non-responsive” based on the higher number of“non-responsive” reference concepts 124 in the cluster 125. In making aclassification decision, the reviewer may consider the number of“privileged” reference concepts 122 versus the number of“non-responsive” reference concepts 124, as well as the thematicrelationships between the uncoded concepts 121 and the “privileged” 122and “non-responsive” 124 reference concepts. Additionally, the reviewercan identify the closest reference concept to an uncoded concept andassign the classification code of the closest reference concept to theuncoded concept. Other examples, classification codes, and combinationsof classification codes are possible.

Additionally, the reference concepts can also provide suggestions forclassifying clusters and spines. The suggestions provided forclassifying a cluster can include factors, such as a presence or absenceof classified concepts with different classification codes within thecluster and a quantity of the classified concepts associated with eachclassification code in the cluster. The classification code assigned tothe cluster is representative of the concepts in that cluster and can bethe same as or different from one or more classified concepts within thecluster. Further, the suggestions provided for classifying a spineinclude factors, such as a presence or absence of classified conceptswith different classification codes within the clusters located alongthe spine and a quantity of the classified concepts for eachclassification code. Other suggestions for classifying concepts,clusters, and spines are possible.

Classifying Uncoded Concepts

The display of relationships between the uncoded concepts and referenceconcepts can provide suggestions to an individual reviewer. Thesuggestions can indicate a need for manual review of the uncodedconcepts, when review may be unnecessary, and hints for classifying theuncoded concepts. Additional information can be generated to assist thereviewer in making classification decisions for the uncoded concepts,such as a machine-generated confidence level associated with a suggestedclassification code, as described in common-assigned U.S. Pat. No.8,515,958, issued on Aug. 20, 2013, the disclosure of which isincorporated by reference.

The machine-generated suggestion for classification and associatedconfidence level can be determined by a classifier. FIG. 8 is a processflow diagram 130 showing, by way of example, a method for classifyinguncoded concepts by a classifier for use in the method of FIG. 2. Anuncoded concept is selected from a cluster within a cluster set (block131) and compared to a neighborhood of x-reference concepts (block 132),also located within the cluster, to identify those reference conceptsthat are most relevant to the selected uncoded concept. In a furtherembodiment, a machine-generated suggestion for classification and anassociated confidence level can be provided for a cluster or spine byselecting and comparing the cluster or spine to a neighborhood ofx-reference concepts determined for the selected cluster or spine.

The neighborhood of x-reference concepts is determined separately foreach selected uncoded concept and can include one or more referenceconcepts within that cluster. During neighborhood generation, anx-number of reference concepts is first determined automatically or byan individual reviewer. Next, the x-number of reference concepts nearestin distance to the selected uncoded concept are identified. Finally, theidentified x-number of reference concepts are provided as theneighborhood for the selected uncoded concept. In a further embodiment,the x-number of reference concepts are defined for each classificationcode, rather than across all classification codes. Once generated, thex-number of reference concepts in the neighborhood and the selecteduncoded concept are analyzed by the classifier to provide amachine-generated classification suggestion (block 133). A confidencelevel for the suggested classification is also provided (block 134).

The analysis of the selected uncoded concept and x-number of referenceconcepts can be based on one or more routines performed by theclassifier, such as a nearest neighbor (NN) classifier. The routines fordetermining a suggested classification code include a minimum distanceclassification measure, also known as closest neighbor, minimum averagedistance classification measure, maximum count classification measure,and distance weighted maximum count classification measure. The minimumdistance classification measure includes identifying a neighbor that isthe closest distance to the selected uncoded concept and assigning theclassification code of the closest neighbor as the suggestedclassification code for the selected uncoded concept. The closestneighbor is determined by comparing the score vectors for the selecteduncoded concept with each of the x-number of reference concepts in theneighborhood as the cos a to determine a distance metric. The distancemetrics for the x-number of reference concepts are compared to identifythe reference concept closest to the selected uncoded concept as theclosest neighbor.

The minimum average distance classification measure includes calculatingan average distance of the reference concepts in a cluster for eachclassification code. The classification code with the reference conceptshaving the closest average distance to the selected uncoded concept isassigned as the suggested classification code. The maximum countclassification measure, also known as the voting classification measure,includes counting a number of reference concepts within the cluster foreach classification code and assigning a count or “vote” to thereference concepts based on the assigned classification code. Theclassification code with the highest number of reference concepts or“votes” is assigned to the selected uncoded concept as the suggestedclassification. The distance weighted maximum count classificationmeasure includes identifying a count of all reference concepts withinthe cluster for each classification code and determining a distancebetween the selected uncoded concept and each of the reference concepts.Each count assigned to the reference concepts is weighted based on thedistance of the reference concept from the selected uncoded concept. Theclassification code with the highest count, after consideration of theweight, is assigned to the selected uncoded concept as the suggestedclassification.

The machine-generated classification code is provided for the selecteduncoded concept with a confidence level, which can be presented as anabsolute value or a percentage. Other confidence level measures arepossible. The reviewer can use the suggested classification code andconfidence level to assign a classification to the selected uncodedconcept. Alternatively, the x-NN classifier can automatically assign thesuggested classification. In one embodiment, the x-NN classifier onlyassigns an uncoded concept with the suggested classification code if theconfidence level is above a threshold value, which can be set by thereviewer or the x-NN classifier.

Classification can also occur on a cluster or spine level. For instance,for cluster classification, a cluster is selected and a score vector forthe center of the cluster is determined as described above withreference to FIG. 5. A neighborhood for the selected cluster isdetermined based on a distance metric. The x-number of referenceconcepts that are closest to the cluster center can be selected forinclusion in the neighborhood, as described above. Each referenceconcept in the selected cluster is associated with a score vector andthe distance is determined by comparing the score vector of the clustercenter with the score vector of each reference concept to determine anx-number of reference concepts that are closest to the cluster center.However, other methods for generating a neighborhood are possible. Oncedetermined, one of the classification measures is applied to theneighborhood to determine a suggested classification code and confidencelevel for the selected cluster.

During classification, either by an individual reviewer or a machine,the reviewer can retain control over many aspects, such as a source ofthe reference concepts and a number of reference concepts to beselected. FIG. 9 is a screenshot 140 showing, by way of example, anoptions dialogue box 141 for entering user preferences for clusteringand display of the uncoded concepts and reference concepts. The dialoguebox 141 can be accessed via a pull-down menu as described above withrespect to FIG. 6. Within the dialogue box 141, the reviewer can utilizeuser-selectable parameters to define a reference source 142, categoryfilter 143, command details 144, advanced options 145, classifierparameters 146, and commands 147. Each user-selectable option caninclude a text box for entry of a user preference or a drop-down menuwith predetermined options for selection by the reviewer. Otheruser-selectable options and displays are possible.

The reference source parameter 142 allows the reviewer to identify oneor more sources of the reference concepts. The sources can include allreference concepts for which the associated classification has beenverified, all reference concepts that have been analyzed, and allreference concepts in a particular binder. The binder can includereference concepts particular to a current document review project orthat are related to a prior document review project. The category filterparameter 143 allows the reviewer to generate and display the subset ofreference concepts using only those reference concepts associated with aparticular classification code. Other options for generating thereference set are possible, including custodian, source, and content.The command parameters 144 allow the reviewer to enter instructionsregarding actions for the uncoded and reference concepts, such asindicating counts of the concepts, and display of the concepts. Theadvanced option parameters 145 allow the reviewer to specify clusteringthresholds and classifier parameters. The parameters entered by the usercan be compiled as command parameters 146 and provided in a drop-downmenu on a display of the clusters. Other user selectable parameters,options, and actions are possible.

In a further embodiment, once the uncoded concepts are assigned aclassification code, the newly-classified uncoded concepts can be placedinto the concept reference set for use in providing classificationsuggestions for other uncoded concepts.

In yet a further embodiment, each document can be represented by morethan one concept. Accordingly, to determine a classification code forthe document, the classification codes for each of the associatedconcepts can be analyzed and compared for consideration in classifyingthe document. In one example, a classification code can be determined bycounting the number of associated concepts for each classification codeand then assigned the classification code with the most associatedconcepts. In a further example, one or more of the associated conceptscan be weighted and the classification code associated with the highestweight of concepts is assigned. Other methods for determining aclassification code for uncoded documents based on reference conceptsare possible.

Although clustering and displaying relationships has been describedabove with reference to concepts, other tokens, such as word-level orcharacter-level n-grams, raw terms, and entities, are possible.

While the invention has been particularly shown and described asreferenced to the embodiments thereof, those skilled in the art willunderstand that the foregoing and other changes in form and detail maybe made therein without departing from the spirit and scope.

What is claimed is:
 1. A computer-implemented system for providingvisual classification suggestions for inclusion-based concept clusters,comprising the steps of: a computer processor configured to executemodules, comprising: a designation module configured to designatereference concepts each associated with a classification code; agrouping module configured to group one or more of the referenceconcepts with a plurality of uncoded concepts into a grouped conceptset; a generation module configured to generate clusters, eachcomprising a portion of the uncoded concepts and the reference conceptsof the grouped concept set; and a suggestion module configured toprovide a visual suggestion for assigning one of the classificationcodes to one of the clusters comprising visually representing each ofthe reference concepts in that cluster.
 2. A system according to claim1, further comprising: a presence module configured to provide at leastone of a presence and an absence of the concepts with each of theclassification codes in that cluster; and a number module configured toprovide a number of the concepts with each of the classification codesin that cluster, wherein the suggestion includes at least one of thenumber and the presence and the absence.
 3. A system according to claim2, further comprising: a concept module configured to provide asuggestion for classifying one of the uncoded concepts in that clustercomprising at least one of the number and the presence and the absence.4. A system according to claim 1, further comprising: a selection moduleconfigured to receive a user selection of one or more sources,custodians, content, and the classification codes of the referenceconcepts, wherein the reference concepts are designated in accordancewith the selection.
 5. A system according to claim 1, furthercomprising: a parameter module configured to receive a user selection ofone or more parameters for generating the clusters, wherein theclustering is performed in accordance with the selection.
 6. A systemaccording to claim 1, further comprising: a spine module configured todetermine a similarity between the clusters and organizing the clustersalong one or more spines based on the similarity, each of the spinescomprising a vector; and a spine classification module configured toprovide a visual classification suggestion for assigning one of theclassification codes to one of the spines comprising visuallyrepresenting the clusters along that spine and the reference concepts inthe clusters along that spine.
 7. A system according to claim 6, furthercomprising: a user selection module configured to receive a userselection of one or more of the clusters and spines; a display module todisplay one or more of the clusters and spines within a compass; anemphasis module configured to emphasize those of the clusters and spinesdisplayed within the compass; and a deemphasizing module configured todeemphasize those of the clusters and spines displayed outside of thecompass.
 8. A system according to claim 6, further comprising: a labelmodule configured to associate a label with each of the spines, eachlabel associated with one or more documents from which one or more ofthe concepts in the clusters along that spine were extracted; and a listmodule configured to display a list of the documents associated with oneof the spine labels.
 9. A system according to claim 1, wherein thevisual representation of the each classification codes comprises atleast one of a symbol, shape, and color different from the visualrepresentations of the remaining classification codes.
 10. A systemaccording to claim 1, further comprising: an identification moduleconfigured to identify one of the concepts as a center of the cluster; aneighborhood module configured to identify a neighborhood of similarreference concepts for the cluster based on the cluster center; and anassignment module configured to assign one of the classification codesto the cluster based on the neighborhood.
 11. A computer-implementedmethod for providing visual classification suggestions forinclusion-based concept clusters, comprising the steps of: designatingreference concepts each associated with a classification code; groupingone or more of the reference concepts with a plurality of uncodedconcepts into a grouped concept set; generating clusters, eachcomprising a portion of the uncoded concepts and the reference conceptsof the grouped concept set; and providing a visual suggestion forassigning one of the classification codes to one of the clusterscomprising visually representing each of the reference concepts in thatcluster, wherein the steps are performed by a suitably programmedcomputer.
 12. A method according to claim 11, further comprising:providing at least one of a presence and an absence of the concepts witheach of the classification codes in that cluster; and providing a numberof the concepts with each of the classification codes in that cluster,wherein the suggestion includes at least one of the number and thepresence and the absence.
 13. A method according to claim 12, furthercomprising: providing a suggestion for classifying one of the uncodedconcepts in that cluster comprising at least one of the number and thepresence and the absence.
 14. A method according to claim 11, furthercomprising: receiving a user selection of one or more sources,custodians, content, and the classification codes of the referenceconcepts, wherein the reference concepts are designated in accordancewith the selection.
 15. A method according to claim 11, furthercomprising: receiving a user selection of one or more parameters forgenerating the clusters, wherein the clustering is performed inaccordance with the selection.
 16. A method according to claim 11,further comprising: determining a similarity between the clusters andorganizing the clusters along one or more spines based on thesimilarity, each of the spines comprising a vector; and providing avisual classification suggestion for assigning one of the classificationcodes to one of the spines comprising visually representing the clustersalong that spine and the reference concepts in the clusters along thatspine.
 17. A method according to claim 16, further comprising: receivinga user selection of one or more of the clusters and spines; displayingone or more of the clusters and spines within a compass; emphasizingthose of the clusters and spines displayed within the compass; anddeemphasizing those of the clusters and spines displayed outside of thecompass.
 18. A method according to claim 16, further comprising:associating a label with each of the spines, each label associated withone or more documents from which one or more of the concepts in theclusters along that spine were extracted; and displaying a list of thedocuments associated with one of the spine labels.
 19. A methodaccording to claim 11, wherein the visual representation of the eachclassification codes comprises at least one of a symbol, shape, andcolor different from the visual representations of the remainingclassification codes.
 20. A method according to claim 11, furthercomprising: identifying one of the concepts as a center of the cluster;identifying a neighborhood of similar reference concepts for the clusterbased on the cluster center; and assigning one of the classificationcodes to the cluster based on the neighborhood.